Symbolically Speaking 1 Running head: SYMBOLICALLY SPEAKING SUBMITTED TO COGNITIVE SCIENCE --- DO NOT CITE Symbolically Speaking: A Connectionist Model of Sentence Production

نویسنده

Franklin Chang

چکیده

The ability to combine words into novel sentences has been used to argue that humans have symbolic language production abilities. Critiques of connectionist models of language often center on the inability of these models to generalize symbolically (Fodor & Pylyshyn, 1988; Marcus, 1998). A connectionist model of sentence production is described which has variables that are inspired by distinctions in the visual system. For several tests of symbolic generalization, the model with variables is better able to generalize to novel sentences than a model without variables. The ability of the model to generalize in a symbolic manner is due to the interaction of the variables and the sequencing abilities of the model. The same model is then applied to problems in language acquisition and used to model dissociations in aphasia. Symbolically Speaking 3 Symbolically Speaking: A Connectionist Model of Sentence Production An important use of language is to be able to talk about novel events and circumstances. In order to do this, we need the ability to take the words that we know, and combine them in novel ways. Applying knowledge to a new situation involves generalizing that knowledge beyond the context in which it was originally learned. For example, we can use nouns in sentence frames that they have never been paired with before. If I teach you the count noun “blicket”, you can produce the sentence “A blicket is a blicket”, even though you have never heard “blicket” used in this manner. This ability to combine words and sentence frames in the absence of previous experience has led some researchers to argue that language requires symbolic capabilities, where knowledge about language is phrased in terms of variables and operations on those variables (Fodor & Pylyshyn, 1988; Marcus, 1998; Pinker & Prince, 1988). In addition to evidence that supports symbolic processing, there is also research that shows that people are recording the detailed statistical properties of the sentences that they are hearing and producing. One source of evidence for this is the role of frequency in language processing, where frequencies of words and syntactic structures seem to influence the processing of language (Garnsey, Pearlmutter, Myers, & Lotocky, 1997; MacDonald, Pearlmutter, & Seidenberg, 1994). Another example of the recording of statistical properties is the fact that infants can acquire the probabilities of syllable transitions in sequences after only 2 minutes of exposure (Saffran, Aslin, & Newport, 1996). These children are able to distinguish sequences that maintained the statistical structure that they experienced earlier from sequences that violated the statistical regularities of their earlier experience. These studies demonstrate that both adults and children are sensitive to the statistical regularities when processing sequences. If the Symbolically Speaking 4 statistical structure is sufficiently rich, then when adults and children experience or generate novel language sequences, they can use the similarity of the novel sentences to other sentences that they have produced or experienced to process these novel sequences. Given that the language system seems to require both symbolic and statistical types of knowledge, theories have been developed which use separate mechanisms to implement these two types of processing, and hence these theories have been called dual mechanism theories. One example of this type of theory concerns the processing of the English past-tense. The English past-tense has a regular form and several exceptional cases. Pinker and Prince (1988) offer a dual mechanism account in which the regular form is handled by a symbolic mechanism (a rule that uses variables), and exceptional cases are handled by a mechanism that is sensitive to statistical regularities (spreading activation in a lexical network). Some theorists, however, have argued that statistical learning is powerful enough to explain both symbolic and statistical processing using a single mechanism (Rumelhart & McClelland, 1986). These single mechanism theories have been implemented in connectionist models (Plunkett & Juola, 1999), and the properties of these models (as surrogates for the theories) have come under scrutiny to see if they are appropriate for modeling human cognition. There is some evidence that certain classes of connectionist models do not generalize in ways that people do. For example, Marcus (1998) found that a simple recurrent network (SRN) could learn equivalence relations like “A rose is a rose” or “A tulip is a tulip”, but when given a novel sentence like “a blicket is a ...”, the SRN could not predict that “blicket” was going to be the next word, even though it had learned other equivalence relationships. Given that humans can complete the sentence with “blicket” (when instructed to do so), it suggests that SRNs do not learn frames like “a X is a X”, where X is a variable that can be bound to any word. This Symbolically Speaking 5 limitation is important, because SRNs have been used extensively for modeling sentence processing (Christiansen & Chater, 1999; Elman, 1990; St. John & McClelland, 1990). While SRNs seem to be limited in this fashion, these limitations do not necessarily apply to all connectionist models. In particular, connectionist models have been developed that are able to learn rules that make use of variables (Shastri & Ajjanagadde, 1993). Also, the brain clearly is a connectionist network that has learned to generalize in a symbolic way. Given this perspective, the task for adherents of connectionist networks is to figure out how to make them operate symbolically to the extent that humans operate symbolically. While there are various definitions of symbolic computation, most definitions require that the symbol processor have an ability to bind instances to variables, and use these variables in combinatorial structures like sentence frames (Fodor & Pylyshyn, 1988). In this paper, I will attempt to show that a connectionist model can learn to act in a symbolic fashion, by training a simple recurrent network to operate on variables. Both symbolic and statistical processing will arise out of this combination because the network will record sequential statistical regularities, and these sequential regularities can include variables, which allows the model to generalize to all entities that can be linked to the variables. The experiments in this paper will concentrate on showing that the model can generalize nouns to novel frames. I will also demonstrate that this ability is dependent on features of the architecture and relational knowledge in the model. Whereas nouns can generalize fairly widely, I will show that verb generalization is more constrained, illustrating the statistical experience of the model. All together, it will be argued that the model’s ability to generalize arises from statistical learning of language regularities in a system that allows the model to generalize in a symbolic fashion. Symbolically Speaking 6 This paper will present a comparison of four model architectures. The common assumptions of the four models will first be described, and then the differences between the architectures will be presented. Three specific tests of novel word-structure pairings will be used to show that the models differ in their ability to generalize in a symbolic fashion. For the rest of the analyses, only the model that performs best at symbolic generalization will be examined. To better understand how the model works, the hidden units will be examined in greater detail. This model will then be tested to see if it can constrain overgeneralization with verbs, a classic problem for a language acquisition system. Finally, the model will be lesioned to see if it breaks down in ways that are similar to human patients with aphasic symptoms. Message Structure and Sentence Grammar Speaking involves mapping from a set of ideas (which will be called the message) to a sequence of words (Bock & Levelt, 1994; Garrett, 1988). To learn this mapping, children must be exposed to sentences in situations where they can infer the message. Language researchers assume that children implicitly learn the internal representations that help them to map between the messages and the sentences, and these representations allow them to produce novel sentences (Pinker, 1989). To simulate this language learning process in training the models, I created a set of training sentences sampled from a grammar. The model learns the rules of the grammar from the limited number of training sentences, and exhibits that knowledge by correctly producing other sentences that have been generated from the grammar. The grammar was designed to enable the testing of several phenomena from the psychological literature on sentence production. Table 1 shows the types of sentences in the model's grammar. The grammar did Symbolically Speaking 7 not include subject-verb agreement or other verb inflections, because the phenomena under examination did not require these morphemes and eliminating them made the model simpler. When creating a data set for training or testing, a set of messages was first generated. The messages defined only the propositional content of the target sentence, and did not encode the actual surface structure of the sentence. Each message was created by selecting an action and entities that were appropriate to the action. For example, the action EAT was paired with an entity that was living (the eater) and an object which was not living and not a liquid (the object of eating). This representation would then be used to select lexical items that matched the constraints of the action. So, with the action EAT, the eater could be “man” and the object could be “cake”. No attempt was made to make the verb constraints match the real complexity of these constraints in natural language. The participants in an event were classified into one of three event roles: agent, patient, goal. The agent was the cause of the action, the goal was the final location for the object, and the patient was the object in motion or the affected object. The patient also operated as a default role for several constructions. The roles did not match exactly the traditional definitions of these roles, but instead were designed to increase the generalization capabilities of the model (see Dowty, 1991, for arguments about why traditional roles do not work). For example, the distinction between themes and patients was collapsed into the role of patient. Location arguments are not always goals, but they were collapsed into that category for the model. The distinctions between the categories that were collapsed together in the model were expressed with verb-specific semantic information. The model's lexicon was made up of 20 verbs, 22 nouns, 8 prepositions, 2 determiners, 11 adjectives, and an end of sentence marker. Eight of the nouns were animate, and fourteen were inanimate. The verbs types included dative (give, throw, make, bake), transitive (hit, build, Symbolically Speaking 8 eat, drink, surprise, scare), change-of-state (fill), locative-alternation (spray, load), cause-motion (put, pour), intransitive (sleep, dance), motion (go, walk) and existence (is). Generally speaking, each verb had an equal probability of being in the training and testing sets. But because existence and intransitive verbs were easy to learn, their proportion in the training and testing sets was reduced, to give the other verbs more training (see appendix for details). For training, each sentence in the message was paired with a particular sentence structure. Very often, natural languages allow a particular meaning to be expressed with several alternative structures (see Table 1). For example, active and passive voice sentences have similar meanings, but differ in the order of the noun phrases and their structural properties. Another alternation in the model was the dative alternation, where the prepositional dative and the double object dative can express the same meanings. This alternation occurred with both transfer datives (give, throw) and benefactive datives (transitive verbs). The last alternation, the locative-alternation, varies the order of the patient (object in motion) and the goal (final location). The pairing of sentences was arranged so that eighty percent of transitive sentences were paired with active voice, and the rest with passives. For datives and locative-alternating structures, each alternative occurred approximately fifty percent of the time. To create some extra variability in the structures that were produced, these percentages were modified by the animacy of the arguments in the sentences, so that animate nouns would tend to go before inanimate nouns 20% of the time (in structures that could alternate). The frequencies of structures in the model vastly oversimplified the real frequencies of these structures in the world, but maintained some of the character of the real frequencies within alternations. Sentence structures in the model are not just formal patterns. They also convey meaning. This pairing of form and meaning, called a construction, has been argued to be the basis for how Symbolically Speaking 9 people generalize sentence structures (Goldberg, 1995). In the model, each sentence construction had a particular set of relational features that were used. These features, which I refer to as construction features (Table 2), identify similarities among constructions and thus helped the models generalize from one construction to a related construction. For example, the intransitive motion construction (i.e. “The girl goes to the café”) is related to the cause-motion construction (i.e. “the woman put the dog onto the table”), because the “girl” and the “dog” are both undergoing motion. Both constructions were assumed to share the feature MOTION. The cause-motion construction was also related to the transfer construction (i.e. “the man gives the dog to the girl”), because they shared both the features CAUSE and MOTION. It is thought that children and adults are able to generalize their use of word order based on abstract features such as these (Fisher, Gleitman, & Gleitman, 1991; Goldberg, 1995; Gropen, Pinker, Hollander, & Goldberg, 1991; Gropen, Pinker, Hollander, Goldberg, & Wilson, 1989). In generating the training and testing sentences, the construction features were used to determine which constructions could alternate. In order to alternate, the construction had to be related to two alternative structures by means of these features (Goldberg, 1995). For example, the dative alternation could use the double object structure (i.e. “the man gave the girl the book.”), because this was designated as the default structure for this construction. But because the transfer dative construction contained the features CAUSE and MOTION, it could also use the prepositional dative structure (i.e. “the man gave the book to the girl.”). The locative alternation arose because these events fit both cause-motion frames (“...spray the water onto the wall”), which were licensed by the MOTION feature, as well as change of state frames (“spray the wall with water”), which were licensed by the CHANGE feature. The passive structure was allowed to alternate with all transitives. Symbolically Speaking 10 Although construction features in the intended message can influence the sentence structure that is chosen, speakers can also choose a structure based on other factors. In production studies, people are told to repeat sentences as they hear them, and for the most part, they are able to do this. Here, some verbatim memory of the structure is guiding the choice. But, when doing repetition, people also frequently change the structure of their sentences (Potter & Lombardi, 1992). What this suggests is that there is some information in the message that allows people to control their structure building, but this information is weak enough that sometimes it is overcome by other factors (Bock, 1986; Bock & Warren, 1985). To represent this weak control information, the model made use of the relative activation level of the construction features. Consider the active-passive alternation. For passives, the AFFECTED feature would be more active than the CAUSE feature, and vice versa for active sentences. For datives, if the TRANSFER feature was more active than the MOTION features, then a double object was produced, otherwise a prepositional dative was produced. For locatives, if the feature CHANGE was more active than MOTION, then a location-patient sentence was produced, otherwise a patientlocative was produced. To set up these differences, I used a prominence parameter (0.8), which controlled how different the activation levels in construction features were. So, the activation of the feature MOTION was 80% of the activation of the TRANSFER feature if a double object structure was desired. How do speakers select between alternations in production? Experimental work in sentence production has shown that speakers plan their sentences incrementally, adjusting their structures to fit the words that have come before (Bock, 1982; Bock, 1986; Bock & Warren, 1985). To create this ability in the model, I used the previously produced word as feedback to let the model know what structure should be chosen. But if the model only gets feedback from the Symbolically Speaking 11 previously produced word, then as the model is learning to produce the correct output, it is using its often incorrect output as feedback, and that makes it even harder to learn to produce the correct output. To deal with this incorrect feedback, I also gave the models training with sentences where the model always received the correct previous word as feedback. One way to think about these different types of feedback is that when the model gets its own previous word, it is producing a sentence by itself. But when it gets the correct previous word, it is doing implicit prediction as it comprehends someone else’s sentence. Twenty-five percent of the training set used the previously produced word as feedback, while the other seventy-five percent of the training set used the previous correct target word. So the model experiences “comprehension” three times as often as production. When the model was tested, it was always given the previously produced word instead of the correct previous word, because we were interested in the model’s ability to do sentence production. The sentence grammar was used to generate 501 training sentences. In order to test the model's ability to generalize, the training sentences had one extra restriction: the word “dog” could never be the goal of the sentence. By testing the model’s ability to produce “dog” as the goal of a sentence, even though it was never trained to do so, we can see how well the model generalizes outside of the regularities in the training set. To test overall generalization, a test set was created with 2,000 randomly generated sentences from the grammar. Because the grammar can generate 75,330 possible messages (not including surface form alternations), this testing set is mostly made up of novel sentences, and therefore can provide a good picture of the overall accuracy of the model. Symbolically Speaking 12 Symbolic Generalization in Different Architectures In this section, several network architectures will be described and compared. The first architecture (Prod-SRN model) is a model that embodies the hypothesis that human generalization is simply due to learning the appropriate statistical representation. This model will be compared with a model whose architecture includes a symbolic processing system (Dualpath model) to see if a symbol processing system helps the model to generalize. The Dual-path model will be compared with a similar model (No-construction model) to see if the construction representations are crucial for generalization. And then to test whether the two separate pathways in the Dual-path model are crucial to its performance, I will compare it to a model that links the two pathways (Linked-path model). In addition to the same training and testing sets, the models also shared a few other features. All the models had to have some way of representing the message, and once the message was set, there was no external manipulation of the message. All the models were taught to produce words as output, where a single unit represented each word. To increase the models’ tendency to choose a single word, the output units employed a soft-max activation function that magnified activation differences (see appendix for further details). All the models were trained using backpropagation of error, which is a learning algorithm that computes the difference between the target representation and the model’s output and then passes this information back through the network in order to guide weight changes (Rumelhart, Hinton, & Williams, 1986). Production-SRN Model The Production Simple Recurrent Network (Prod-SRN, Figure 1) was based on Elman’s (1990) simple recurrent network, which mapped from each word to the next word in the Symbolically Speaking 13 sequence. Because one way to view the model is that the output is production, and the input is comprehension, the input units will be prefixed with the letter “c”. This model had a set of units that represented the input word, the cword units, that projected into the hidden units and then to the output word units. The hidden units copied their state to the context units, which averaged its previous state and the new input from the hidden units, to create its new state. The context units projected back to the hidden units, allowing the model to learn sequential dependencies by retaining important state information in the context units. Because production involves planning a particular intended sequence (as opposed to sequence prediction), the Prod-SRN included a message. The message was connected to the SRN hidden units, and this allowed the model to use the message to guide the sequence generation. The message representation used binding by space (Chang, Dell, Bock, & Griffin, 2000; Dell, Chang, Griffin, 1999). That is, different event roles were represented by different banks of units. The message had slots for each role (agent, patient, goal) and a slot for the action. Each of the roles had a localist semantic representation: a unit for the meaning of dog in the agent slot (e.g. "the dog chased the cat") and a separate unit for dog in the patient slot (e.g. "the cat chased the dog"). Each action was represented by a unique action feature. The construction features were also included in the message, in the action slot. Table 3 is an example message for the sentence “A man baked a cake for the café”. Because this message has a separate set of semantic features for each slot in the message, the features in each slot are labeled with a number to show that they are different from the same feature in another slot (e.g. CAKE1, CAKE2, CAKE3). The action slot did not overlap with any other slots, so those features are not given the extra number index. In this message, there are construction features (CAUSE, CREATE, TRANSFER) and a verb specific feature (BAKE). Symbolically Speaking 14 Definite articles (“the”) were marked with the slot specific feature (e.g. DET3). Indefinite articles (“a”) were not marked (because they cannot occur with mass lexical items (e.g. coffee), leaving them unmarked made their use more dependent on the semantic category) The output of the model was a localist representation for the words in the lexicon. The hidden layer and context layer were 50 units each, and the context units were initialized to 0.5 at the beginning of each sentence. Dual-path Model The Dual-path model was designed to generalize symbolically, and hence it differed substantially from the Prod-SRN model. In language production, symbolic generalization is exhibited by placing words in novel sentence positions. If you learned a new word, you could use this word in a variety of frames. To get this word-based generalization, the mapping from lexical semantics to word forms should be the same, regardless of where the word occurs in the sentence. Capturing both lexical and sentencelevel aspects of words is similar to a problem in the visual system of both categorizing an object and recording its position in a scene (Landau & Jackendoff, 1993). The process of object categorization must remove location specific information and transform the object to take into account the point of view of the viewer, in order to get an invariant representation that can be used for categorization (Kosslyn, 1994 for a review). The process of locating an object, on the other hand, does not need to concern itself with the identity of the object, in order to determine the position in space. These two functions have been identified with separate brain structures, the what (object) and where (location) pathways (Mishkin & Ungerleider, 1982). These two separate representations have to be bound Symbolically Speaking 15 to each other, in order to know which object occurs in which location. The resulting system can recognize known objects in new locations and identify the location of unfamiliar objects. That is, it generalizes well. And it does so because of the separation (and binding) of the object and location information. Just as the visual system can generalize in different ways because it has separate what and where representations, a model of sentence production should be able to gene ralize well if it represented its message in several separate representations that are linked together. That idea was the basis for the Dual-path model. This architecture had two pathways, one for representing the mapping of object semantics to word forms, and another for representing and mapping objects (and the words that describe them) into appropriate sentence positions (Figure 2). The first pathway of the model was the message-lexical system. This subnetwork was a feed-forward network from the message to the lexicon. The message in this model was represented in weight bindings between a layer of where units (thematic) and a layer of what units (semantic). By using this type of representation, the same what units could represent the meaning of a word, regardless of its event role. The where units represented the agent, patient, and goal event roles, and another unit represented action information. The what units represented the semantics of words using the same localist representation that was used for the Prod-SRN. Messages in this model were represented by setting the weight between the where units and the what units to an arbitrary “on” value (see appendix for details). For example, if a dog was the agent, then the agent where unit would be connected to the DOG feature in the what units. If the dog was the patient, then the patient where unit would be connected to the same DOG feature. In this way, we could represent the different roles of dog in these events, while maintaining the common semantics that all dogs share. Symbolically Speaking 16 Where do messages come from? In the model, the messages were simply set before production begins. In people, messages can come directly from external visual inputs (in picture description), or they can be generated from internal representations (as in recall or discourse tasks). Kosslyn (1994) has used a variety of methodologies, such as neuropsychological studies, psychology experiments, and computational modeling, to argue that both internally and externally generated visual representations are both represented and operated on by the same systems. Since we can describe these visual images using our language production system, the language system must be linked to the visual systems, and these visual images must constitute a type of message representation. So what about non-visual meanings? It has long been argued that abstract meanings often derive from more concrete meanings (Barsalou, 1999; Lakoff, 1987). For example, “The White House” is both a concrete physical location, and also an abstract label for the office of the presidency as in “The White House is not saying anything”. Abstract relations are also derivable from visual representations. For example, the preposition “to” can represent the physical endpoint of a path of motion as in “go to the store”. But this same preposition is also used to represent abstract transfer even if that transfer does not require motion as in “tell the story to the children”. In this sentence, the story undergoes abstract motion between the mind of the storyteller and the children. If abstract meanings often come from concrete meanings, and concrete meanings often have a visual component, then it seems reasonable to assume that conceptual representations are elaborations of the output of the visual system, or at the very least tightly linked to that system. Landau and Jackendoff (1993) suggest that that visual, auditory, and haptic information come together to form a spatial representation, and this spatial representation is the input to the language system. They hypothesize that a key feature of this Symbolically Speaking 17 spatial representation is the fact that it encodes the distinction between objects (“what”) and locations (“where”), and that this distinction is related to differences between nouns (encoding objects) and prepositions (encoding locations). Together with the evidence from the imagery literature, it seems reasonable to conclude that speakers can generate messages from external stimulation or internal generation, and that these messages, which can represent concrete or abstract concepts, are represented in a form that reflects the distinction in the visual system between objects and locations. The other part of the messagelexical system was the connection between the what units and the word units. This allowed the model to learn a word label for each meaning. Since there was only one set of what units (unlike the Prod-SRN model, which duplicated the analogous units for each thematic role), learning the mapping of the semantic feature DOG to the word “dog” allowed the model generalize this word to other event roles. By using a single semantic/lexical network, this model is more in line with mainstream lexical production theories (Dell, 1986; Levelt, Roelofs, & Meyers, 1999) than the messagelexical mapping of the ProdSRN model. The second pathway of the network, the sequencing subnetwork, was a simple recurrent network with a few other inputs (Figure 2). The network mapped from the cword units to a hidden layer. The hidden layer received input from context units, which, like the Prod-SRN model, had a history of their previous states as well as a copy of the previous hidden unit states. The hidden layer then mapped to the same word units that are used in the messagelexical system. Like Elman’s (1990) model, in between the lexical layers (cword and word) and the hidden units were several compression layers, which helped the sequencing network to create generalizations over words, rather than word-specific representations. Symbolically Speaking 18 The sequencing subnetwork also received input from a lexical-comprehension subnetwork that was just a reverse version of the messagelexical network. Without this subnetwork, the model would not be able to vary its sentence structures based on the previous word that had been produced. If you said "the cat", it could be the beginning of the sentence "the cat chased the dog" or the "the cat was chased by the dog". Without knowing what role "cat" plays in your message, you do not know whether to continue the sentence with an active or passive structure. The reverse messagelexical network tells the sequencing network the role of the last word that the model produced, which allowed it to dynamically adjust the rest of the sentence to match the beginning. Because the mapping from words to their meanings is exactly the process that is supposed to go on in comprehension, these units are labeled as cword, cwhat, and cwhere. In order for the model to use the cwhat-cwhere units, it had to learn the mapping between cword and cwhat. That is, it had to learn the meaning of each word in the comprehension direction. Because the error signal from the word units, that is, the produced word, was backpropagated along the weights in the network, its effects were weakened as it goes further back in the network. The error signal from outputted words was not sufficient to learn the cword to cwhat mapping in a way that would help the overall learning of production. Therefore, to help these units learn, the cwhat units were provided with the previous what units’ activation as target activations. But since, initially, the model has not learned to control the where units yet, it did not have very good targets to give to the cwhat system. What happened was the model bootstrapped itself into learning words. For example, suppose the model was learning a sentence where the agent was a cat. At the beginning of training, the network had random weights. To get an error signal to the cwhat units so that the model could learn that cword unit “cat” is should Symbolically Speaking 19 be linked to the cwhat unit CAT, the model needs to activate the what unit CAT by activating the agent unit in the where layer. But because the network had random weights, the model could not initially activate the agent unit enough to provide error for the cword-cwhat links to be learned. But slowly, as the model learned to activate the where units appropriately, the what unit activations became more distinctive, and more error was passed back to the cwhat units. Intuitively, the model learned to comprehend word meanings by predicting how it would describe the same situation with its own production system (even if it was not able to produce the appropriate output sequence). There were a few other details about the Dual-path architecture. The hidden layer in the Dual-path model was smaller than in the Prod-SRN model (20 units instead of 50 units), because in this architecture, the hidden layer did not have the difficult task of mapping all the message elements into words. The cwhere units were soft-max units, which forces these units to choose one winner and to reduce the activation of competitors. To help the model to remember what event roles had already been produced, the model also had a set of context units called cwhere2. The cwhere2 units received half of their activation from the previous cwhere states, and half of their activation from the previous cwhere2 states. Because the cwhere units were strongly biased to represent the present role of the cword input, the cwhere2 units helped the model to record the history of roles that the model had gone through. The hidden units also received inputs from a set of construction units that held the construction features. These features helped the sequencing system to create appropriate sequences that reflected the message for particular constructions. The functionality of the construction features will be examined by comparing the Dual-path model to the No-construction model, an otherwise identical model which lacked these features. Symbolically Speaking 20 Table 4 shows how the Dual-path model would represent the example message that was used earlier (“A man baked a cake for the café.”). Because this model, unlike the Prod-SRN model, only used one set of semantic features, the features were not indexed with a number. The construction layer held the construction specific features. There were two points where the messagelexical and the sequencing systems interacted. One point was a connection from the hidden units of the sequencing system to the where units of the messagelexical system. This allowed the model to sequence the where units, and enabled it to produce message-related words in appropriate places. But because the sequencing network did not have access to the message, it tended to develop representations that were independent of the message content. That is, its representations tended to be syntactic, as I show later in an analysis of the hidden units. The second point of interaction between these two systems was the word units. Here the messagelexical system activated meaning-related possibilities, and the sequencing system activated syntactically-appropriate possibilities. The intersecting activation from these two sources enabled the production of message-appropriate words at the proper positions in sentences. The use of separate networks for each mapping was consistent with work in sentence production that showed that lexical-semantic factors and syntactic factors have independent effects on sentence structures (Bock, 1987; Bock, 1990; Bock, Loebell, & Morey, 1992) No-construction Network An interesting problem for the Dual-path architecture was that in order to learn syntactic representations, the message had to be isolated from the sequence learning system. Otherwise, connectionistlearning procedures would cause the sequence system to use the rich message Symbolically Speaking 21 representation to optimize its sequential states, and these states would come to have messagespecific semantics, instead of being more syntactic. But by connecting the construction features to the sequence system, the sequencing representations could learn more construction-specific representations. For example, the model might learn that subjects of different constructions have different properties. The subjects of “go” tend to be mobile, while the subjects of “give” tend to be animate and have a grip of some sort. If this is the case, the model might not generalize subjects of “go” to the subject of “give”, because of these distributional profiles. So one possibility that is worth examining is that a model without these constructional features would be able to generalize more widely, because it would be free to generalize in a purely syntactic fashion, independent of any meaning. That is, it would treat the subjects of “go” as the same as the subjects of “give”. This possibility was examined by training a model that was identical to the Dual-path model, except the construction units were disabled. The sequencing subnetwork only had information about the previous lexical item and the role of that item in the message (cwhere and cwhere2). As in the Dual-path model, the sequencing subnetwork had compress units that reduced the information that could be linked to the cword and word units. This made this subnetwork more likely to develop representations based on syntactic class rather than lexical items. Without the construction units, this subnetwork representation should be even more syntactic, because it does not have a particular target construction in mind. By comparing this No-construction model with the Dual-path model, we can evaluate the utility of just having relatively pure syntactic structures, in the absence of any construction-specific knowledge about when to use them. Symbolically Speaking 22 Linked-path Network One important claim about the Dual-path model is that the separation of the two pathways is critical to getting the appropriate type of generalization. In particular, the separation of message-lexical system from the sequencing system keeps lexical semantics (what units) from influencing syntactic structures. A clearer test of this hypothesis is to compare the Dual-path model with a model that allows the what units to influence the sequencing system. This model, which I call the Linked-path model, was exactly the same as the Dual-path model, except it had an extra link from the what units in the messagelexical system to the hidden units in the sequencing system. This link removed the separation between the two systems, and allowed the hidden units to develop representations based on the features that were active in the what representation. Since these representations were highly predictive of the target sentence, it is likely that the hidden unit representations would make use of this information to increase their accuracy, and that should make its representations more specific to the sentences that are used in training, and less able to deal with sentences that have not been trained. Experiments Four different training sets (501 sentences) were created using different random seeds. For each of these sets, the Prod-SRN, Dual-path, No-construction, and Linked-path models were trained for 4000 epochs. This amount of training resulted in a good accuracy within a reasonable amount of time. On analogy with human subjects, the label model subject will be used to refer to differences that are due to a particular training set. So each model type had four model subjects, yielding a total of 16 models. Each model was tested on its own training set and the same 2000 Symbolically Speaking 23 randomly generated test sentences. The dependent measure was the percentage of sentences that matched the target sentences. Results: All of the models achieved higher than 98% accuracy on the training set in 4000 epochs (Figure 3). The Dual-path and Linked-path reached 99% accuracy at 1400 epochs. The Noconstruction model peaked at around 97% after 2400 epochs. The Prod-SRN model rose slowly, eventually reaching 99% at 3400 epochs. These results show that all four architectures were able to learn the training set within a reasonable amount of time. To test overall generalization, I looked at the accuracy on the set of 2000 test sentences generated randomly from the grammar (Figure 4). On these test sentences, the differences among the architectures were evident. The Prod-SRN model never generalized very well. Even as the training accuracy reaches 99%, the testing accuracy maxed out at 12%. The Noconstruction model did better, reaching a final accuracy of about 52%. The Dual-path and Linked-path models jumped above 70% after 1200 epochs (just as the models were reaching the maximum accuracy on the training set). Here the two diverge, and the Dual-path model reached 79% while the Linked-path model fell to 68%. Analysis of variance (ANOVA) was performed on the accuracy at epoch 4000 for all four model types with training set as the random factor. Model type was significant (F(3,12) = 74.4, p < 0.001). Pairwise comparisons were performed between the different model types, and all differences were significant except the difference between the Dual-path model and the Linked-path model. The large differences in the generalization abilities at epoch 4000, when the training accuracy is the approximately the same, suggest that the architecture plays a crucial role in model’s ability to generalize. Symbolically Speaking 24 Another point to notice is that the Dual-path model did not lose its generalization ability after it reached 99% accuracy on the training set. Instead, the model continued to improve, going from 76% at epoch 1400 to 79% at epoch 4000. So the Dual-path model seems to avoid overfitting the training set. Overfitting is a problem for generalization. Normally, the better adapted a model is to the particular characteristics of the training data, the worse it becomes at dealing with new data. The Linked-path model may suffer from its overfitting of the training set, because at epoch 1400, its testing set accuracy asymptoted, and began to decline. The difference in the sentence accuracy between epoch 1400 and 4000 was computed for all model subjects in each model type. The mean difference was negative for the Linked-path model (-0.02), while it was positive for the other models (Prod-SRN = 0.03, No-construction = 0.01, Dual-path = 0.02) (F(3,12) = 3.5, p < 0.05). Pairwise comparisons revealed that the Linked-path was different from the Prod-SRN and the Dual-path model. The fact that the Dual-path and the Linked-path both maintained the same level of accuracy on the training set for this period, but the Dual-path continued to improve and the Linked-path got worse suggests that overfitting is a problem for the Linked-path model. Because its hidden units had a link from the message (in the what units), the Linked-path model could get better at the particular messages in the training set, which might then reduce the model’s ability to generalize. The Dual-path model avoided overfitting because its isolation of lexical-semantics and sequencing kept message-specific knowledge from reducing generalization. Dog-goal Experiment One part of symbolic generalization is the ability to bind words to novel event roles, and generate sentences that convey those novel meanings. For training of the models, the grammar Symbolically Speaking 25 was designed so that the word “dog” was never allowed to be the goal of the sentence. By giving the model messages where the goal was bound to DOG, we could see whether the model can generalize its experience with other goals to produce these novel sentences correctly. One hundred test sentences were randomly generated with “dog” in the goal slot. All four model subjects for each of the four model architectures were tested on this dog-goal test set. The dependent measure for this analysis was the percentage of sentences for which all the words match the target sentence exactly, or the overall sentence accuracy. The Prod-SRN model produced 6% of the dog-goal sentences correctly (Figure 5). The No-construction model did better, producing 55% of these sentences correctly. The Linked-path and Dual-path model both generalized well, with the Dual-path model at a higher level (82%) than the Linked-path model (67%). An ANOVA was performed, and model type was significant (F(3,12) = 105.6). Pairwise comparisons showed that all differences were significant. The dog-goal test helps to explain why the Dual-path and Linked-path models were better than the other two models in the overall generalization test using 2000 sentences. While all the models achieved good accuracy at the sentences that they were trained on, these novel sentences had words in roles that had not been trained before. The Dual-path and Linked-path models derived some benefit from the dual-pathway model architecture, which allowed the semantics for words in different roles to be in the same semantic layer (what units). So if you learned to say “dog” in any of these models, there was a link from the semantic unit DOG to the word unit “dog”, and this allowed it to be said in different sentence positions. The Prod-SRN used a binding-by-space message representation, where different roles had their own set of semantic units. In this model, the semantics for “dog” in the goal slot (DOG3) had never been trained, and so it cannot generalize that word to a novel position. It also seems that construction information Symbolically Speaking 26 is crucial, because the Dual-path model was better than the No-construction model. Construction information helped these models sequence the GOAL where unit at the time when the goal should be produced. And, the difference between the Dual-path and Linked-path models suggests that the Linked-path model was including lexical-semantics in its dative syntactic representations, and that hurt its ability to produce novel dog-goal sentences. So the dog-goal test showed how slotindependent lexical mapping, construction information, and abstract syntactic frames work together to exhibit symbolic generalization. Identity Construction Experiment The dog-goal experiment was a good test of the ability of the models to generalize to a novel sentence position. But this test might inflate the generalization abilities of the models, because the goal often occurred at the end of the sentence in many constructions, and so we do not know if the model would be able to continue to generate structure after producing a word in a novel position. Also, it could be that accidental distributional properties of the dative construction were influencing generalization. Consequently, another test was carried out. Inspired by Marcus’s (1998) claim about the inability of SRNs to produce novel sentences like “a blicket is a blicket”, this identity construction will be used to see how well these models generalize. This identity test takes advantage of the accidental fact that in the random generation of the training set, only a subset of sentences used the identity construction. (Recall that existence and intransitive verbs were less frequent than other constructions in training). Novel identity construction sentences were randomly generated for each model subject (the actual number varied between 48 and 58). The four model subjects for each model architecture were tested on these sentences at epoch 4000 (Figure 6). One model subject did not produce any Symbolically Speaking 27 correct identity constructions for any of the model types (the training set had only 2 identity construction sentences, and therefore, it probably did not have enough experience to learn this construction), and so this model subject was excluded. Looking at the other three model subjects, the difference between the Dual-path model and the other three models was quite dramatic. The Dual-path model had an 88% accuracy, while the other models do not get above 44%. An ANOVA was performed on the remaining three model subjects, and model type was significant (F(3,8) = 11.2, p < 0.005). Pairwise comparisons found that the Dual-path model was superior to the Prod-SRN and No-construction models, but not the Linked-path. The identity construction test showed that the Dual-path model’s generalization capabilities are broad, applying to different words in multiple sentence positions. In the model, the identity construction used the patient where unit (as a default role) to instantiate the single argument of this construction. All the models had experience mapping patients to both preand post-verbal sentence positions with other constructions, which probably allowed models like the Prod-SRN and No-construction models to produce some of these sentences. But this ability is increased when the model has the Dual-path architecture with construction information, as in Dual-path and Linked-path models. This is partially due to the existence of the construction feature EXIST, which signaled the identity construction. The architecture allowed the model to create a rule where the patient unit was treated as a variable: EXIST -> PATIENT is PATIENT. By sequencing the components of this rule, the model was able to generalize to sentences that had patients that had never occurred in this construction in training. While the identity construction test is similar to Marcus’s (1998) “A blicket is a ...” test, it differs in that the models being compared here had previous experience placing the nouns into both surface positions. In Marcus’s test, a novel word “blicket” is used; the human does not Symbolically Speaking 28 have experience placing the novel word in any sentence position. But even though the models have the previous experience placing nouns into these positions, the Prod-SRN and the Noconstruction models cannot make use of this experience to increase their generalization. Only the models like the Dual-path model are able to get this right. Novel Adjective Pairing Test Both of the two previous generalization tests involved placing a noun in a position that it was not trained in. A contrasting question is whether novel noun phrase internal sequences can be generated. The next test examined this question by exploiting restrictions on adjectives in the training grammar. The grammar restricted adjectives so that they could only pair with appropriate nouns. There were two kinds of adjectives, those that were restricted to animate entities (nice, silly, funny, loud, quiet) and those that were not restricted (good, red, blue, pretty, young, old). So dogs could be nice, but cakes could not be nice. But both cakes and dogs could be old or good. In all the previous training and test sets, these restrictions were enforced. People, however, can make metaphorical extensions of animate adjectives to inanimate elements. For example, a car can be “nice” if it is easy to maintain. Or a wall can be “silly” if it is painted in a crazy fashion. A good test of a model’s ability to generalize symbolically would be to see if it can produce “nice car” or “silly wall” when the message calls for it. One hundred randomly generated test sentences were generated with animate adjectives attached to inanimate nouns. All four model subjects for each model type were tested on this novel-adjective test set at epoch 4000 (Figure 7). Again, the Dual-path model was best, producing 73% of these sentences correctly. The Linked-path model produced 53%, Noconstruction model produced only 32% correct, and the Prod-SRN was worst at 2% correct Symbolically Speaking 29 (F(3,16) = 37.3, p < 0.001, pairwise comparisons found that all differences were significant except for the difference between Linked-path and Dual-path). The ability of the Dual-path to generalize better than the other models in this case is not primarily due to the what-where system, as in the previous two tests. In the earlier tests, if the model produced the appropriate where unit at the right time, then the models would have a good chance of generalizing appropriately. In this case, both adjective and noun semantics (what units) were connected to the same where unit, so sequencing the where system appropriately was not enough to generalize appropriately. Rather the model had to develop a way to sequence words within a phrase in a symbolic manner. For this to occur, the model had to get two things sequenced appropriately. First, the appropriate where unit for the phrase had to be activated. The No-construction model probably did not activate this where unit appropriately, because it did not know anything about the message. The second part was to sequence the words within a phrase without reference to their cooccurrence frequency. The Prod-SRN model should record lexical-specific cooccurrence frequency, because its hidden units have access to the semantics of the whole phrase, and so they will prefer that these adjectives be followed by animate nouns. The Dual-path and Linked-path models were able to meet both requirements for producing these novel phrases. Its constructional knowledge helped it activate the right where unit at the right time. And the compress units in the sequencing system kept the models from recording lexical-specific cooccurrence frequencies. The fact that these models can do this metaphorical extension suggests that the model has developed a symbolic ability to sequence words within phrases, in addition to its ability to sequence phrases within a sentence. Conclusions about Symbolic Generalization in Different Architectures Symbolically Speaking 30 Why was the Dual-path model better at generalization than the other three models? There were three dimensions that were manipulated in these experiments. One was the message type. The Prod-SRN had a slot-based message, while the other three models had the what-where message representation. The what-where message allowed those models to learn syntactic structures which used the where units to activate variable information in the links. Over all the comparisons, the models with the what-where message were better than the Prod-SRN at generalizing, and so it seems that there is a definite benefit to using this type of variable representation. The second dimension was the architecture of the network. The issue was whether a separation between the message and the syntactic representations was needed to achieve good generalization. This comparison can be seen in the differences between the Dual-path network and the Linked-path network. These networks were equivalent except that the Linked-path network linked the two pathways, and this allowed the syntactic representations to use the information about the message that was being produced. The Dual-path model was clearly better than the Linked-path model in the magnitude of all generalization measures (significantly better in two comparisons). The Linked-path/Dual-path comparison makes an important point about the architecture of the Dual-path. One could say that the higher performance of the Dual-path is not due to its architecture, but rather it is a result of having more connections than the Prod-SRN or the No-construction models. But, the Linked-path model has even more connections than the Dual-path, and the fact that it seems to be doing worse than that model suggests that the Dualpath architecture is fairly optimal for the task of symbolic generalization. The third dimension that was manipulated was the presence or absence of construction information. While it is true that the Dual-path model did better than the No-construction model Symbolically Speaking 31 on all measures, its higher performance cannot be solely due to the existence of construction features. Recall that the Prod-SRN model’s message representation also had these construction features. In the Prod-SRN model, construction information could help the learning of sequencing constraints, but it had to learn the individual combinatorial relationship between the construction information in its message and each of the semanticlexical mappings separately. In the Dual-path architecture, the construction information was connected to an SRN that was blind to the message and able to sequence variables through the where units. So in this network, the value of the construction information was increased, because it was used to sequence variables within abstract frames. Given that this paper addresses the limitations of connectionist models that Marcus (1998) points out, it is worthwhile to frame the model comparison in terms of his notion of a training space. Marcus argues that connectionist models do not generalize beyond their training space. The training space is the set of input feature values that have been experienced during training. These input feature values have associated output outcomes, and so the model can interpolate between input values to find interpolated output values. But outside of the training space, these models cannot extrapolate to find appropriate output values. While all four model types received the same training set within a model subject, the architecture of the models created different training spaces for each model. The Prod-SRN model has a message where each role occupies different units. That means that its training space is role-dependent, where each word’s semantics has to be trained in a particular role to generalize appropriately. In the other three models, there is only one set of what units that represents lexical semantics for all the event roles. If the model learns to produce a word correctly, then that word’s semantics is in the training space. And because the Linked-path and Dual-path models have the construction Symbolically Speaking 32 features, the ability to produce one construction correctly allows other sentences in that construction to be in the training space. The problem with the Linked-path model is that the sequence representations that it uses are contaminated with lexical-semantic information, because of the link between the message-lexical system and the sequencing system. The Dualpath model overcomes this limitation by isolating these two systems, forcing the sequencing system to only use a limited number of syntactic categories to make the distinctions that will be useful. Because the training space of the Dual-path model is divided into constructions (which operate on syntactic categories and variables) and lexical-semantic representations (which select words), it is able to generalize in a manner that is similar to humans. Hidden Unit Analysis In order to understand how each pathway in the Dual-path model works, it is valuable to examine the activation of the units in the model as they process sentences. It is useful to look at the compress units if you want to understand the sequencing system, because these units directly influence the production of words, and so the lexical effects of the sequencing system must be propagated through this layer. To see how the message-lexical system works, it is not as useful to look at the what unit activations, because the what units depend on the message-specific where-what links. So I looked instead at the activation of the where units, which were messageindependent. While the model was tested on the 2000 sentences test set, the activation of the where and the output compress units was recorded. There were 4 where units (agent, patient, goal, action) and 10 compress units. Given the previous generalization results, we should expect different types of representations in these units in different situations. The average activation of these nodes for one model subject when tested on the 2000 novel sentences was calculated, and Symbolically Speaking 33 the results were quantized into 5 distinct levels, to make the similarities between units more evident (Table 6). Because of this averaging, the most diagnostic information comes from the strongly activated units (dark elements in table), because the less activated units could reflect the averaging of strong and weak elements over different sentences. The activation of these nodes was averaged by syntactic class. First consider the compress units, which represented the output of the sequencing system. Here, the goal was to test the claim that these represent syntacticlike states in the model. While the activation was quite distributed, there were some clear patterns. Verbs mainly used the units C1, C2, C4, C5, C10 to activate the appropriate verb. C6 and C8 seemed to be more specialized for phrasal elements like nouns, adjectives, prepositions, and determiners. Nouns and adjectives shared the same units except nouns also activated C4. Determiners and prepositions both had many units activated. Auxillaries and intransitive verbs shared C3 and C5. Syntactic categories that had more activated units in this layer depended on the sequential system more, and therefore closed class words seemed to depend on this pathway more than open class elements. Thus, the compress units seemed to capture important syntactic knowledge in the model in the form of syntactic category distinctions and verb-subcategorization information. The where representations for different syntactic categories tell a different story (Table 6, right side). The where representations were not differentiating syntactic categories strongly. The action role (AC) was active for verbs, prepositions, determiners, and nouns. The patient role (PA) was active for determiners, nouns, adjectives, prepositions, auxillaries, and intransitive verbs, and the agent role (AG) had a similar pattern except prepositions were not activated by this role. The goal role (GL) was active for prepositions only. So syntactic categories were not distinguished strongly by these units. Also, subcategorization distinctions were not maintained Symbolically Speaking 34 in these units. So it would seem that the sequencing system, rather than the what-where system, was responsible for most of the syntactic behavior of the model. To understand how the where units influence processing, a second analysis was done, looking at activation of these units given a particular sequence of syntactic categories. In English, sequences of syntactic categories encode role information, and so the where units should be more defined when separated by the preceding sequence. In Table 7, the average activation of units is given for a syntactic category in a particular sequence (marked in bold). For example, if we had a sentence “The man gave a cake to the cat”, the state of the compress and where units would be recorded for each word in the sentence. These states were averaged over many sentences with similar sequences of syntactic categories to get an average “DET NOUN VDAT DET NOUN PREP DET NOUN” state representation, and this was placed into Table 7. At the end of the table, the average activation for a prepositional dative sentence with the word “dog” in the prepositional phrase was appended for comparison with other nouns. Two other lines were also appended to show the average activation for prepositional phrases with an adjective. The states came from the sentences in the 2000 novel sentences test set at epoch 4000 in a single Dual-path model. As the model produced the sentences, the activation of the where units tracked the phrases that the model was producing. As it produced the subject “DET N” (e.g., the man), the agent (AG) unit was activated strongly, and it turned off as action (AC) was activated and the dative verb (VDAT) (e.g., give) was produced. The next noun phrase started with both patient (PA) and goal (GL) activated, but as the patient phrase won (e.g., the cake), the GL unit was shut off. This demonstrated the incremental nature of the model’s decisions about structure selection, because if it had planned sentence structure earlier, GL would be deactivated from the beginning Symbolically Speaking 35 of the production of the patient phrase. Once the patient phrase and the preposition were produced, the model produced the goal phrase by activating the GL unit (e.g. “the cat”). This GL unit stayed activated through the whole phrase. Because the lexical-semantics information was embedded in where-what links, the sequential activation of where units is exactly the behavior that we would expect in order to extract this information at the appropriate moment. There is some independent support for thinking of sentence production as a process that involves moving attention over event roles, or the sequential activation and deactivation of roles. Griffin and Bock (2000) found that when speakers describe pictures of events, they tend to fixate on the picture elements right before naming them in their sentences. They found that this fixation depended on the syntactic structure that they actually used, which suggested that the syntactic structure was influencing eye movements. Production theories with static message representations (e.g. Chang, Dell, Bock, & Griffin, 2000) would not predict that eye-movements would be so synchronized with structural decisions. The Dual-path model, however, used message representations that were spatially represented in where-what links, and which were dynamically activated during production. During event description, activation of the appropriate where unit might be related to focusing attention on elements in a scene. If this is the case, then both structural decisions and eye-movements would be related to the activation of where units, and this provides a reason why syntax should affect visual processing. In the goal phrase with an adjective “DET ADJ N” (e.g. the red cat), there was more evidence for the incrementality of the model’s decision-making processes. The activation of the compress units and where units were identical for the production of the noun in the sequence “...PREP DET N” (e.g. cat) and the production of the adjective in the sequence “...PREP DET ADJ N” (e.g. red). What this shows is that the model did not plan to produce either the adjective Symbolically Speaking 36 or noun specifically at the point after the determiner. Rather, it simply produces the word that was most activated at that point in the sentence. If the adjective was produced, then the model activated C4 and deactivated C1 to produce the noun at the next timestep. If the noun was produced, then the model was done producing the sentence. So the model was being incremental at various points in processing, which is consistent with experimental work in language production demonstrating that sentence construction is sensitive to the lexical availability of words at different points in processing (Bock, 1986; Ferreira, 1996). The hidden-unit analysis can also help us understand how the model generalized words to novel positions as in the dog-goal test. When we look at the activation pattern for nouns that have been trained (“...PREP DET N”), it is no different for nouns that have never appeared in that structure (“...PREP DET DOG”). The model learned to treat this novel message in a way that was identical to the other messages in this construction. Because of the architecture of the Dual-path network, this ability was due to the construction features, in concert with the cwhere information, activating the goal unit at the appropriate moment in the sentence. The mapping from construction units to where units was not novel (it was shared with all dative sentences), so the model could sequence any word that was attached to the goal unit. The equivalent mapping in the Prod-SRN involved mapping from role specific semantic units like DOG3 in the goal slot to the appropriate sentence position. DOG1 and DOG2 had been trained before, but DOG3 had never been used before, and that was why the Prod-SRN model failed to generalize properly. DOG3 was not in the training space of the Prod-SRN model, because it was not explicitly trained. The hidden unit analysis tells us several things about the Dual-path model. One is that syntactic categories are represented primarily in the sequence system, while the activation of the Symbolically Speaking 37 where system seems to reflect the target phrase that is being produced. Processing in the model is incremental, and this incrementality can be seen in the way that lexical factors influence structure selection, and the way that spatial attention tracks the production of sentences. And the model seems to be treating novel sentences in a way that is identical to the way that other sentences in that construction are treated. Constraining Overgeneralization: Baker’s Paradox In the previous experiments, the emphasis was on showing that the Dual-path model had the ability to generalize in a symbolic fashion. But, because this model used connectionistlearning mechanisms to develop its internal representations, it should also represent statistical regularities at all processing levels, and we should be able to see their influence on some aspects of processing. A useful domain to look at the role of statistical processing is the way that verbs are paired with structural frames. Unlike nouns, verbs seem to be more selective about the structures that they can be paired with, and this relationship seems to be statistical, that is, it is graded. While nouns are easily paired with sentence frames, verbs are less easily associated to frames that they have not been heard in (Tomasello, Akhtar, Dodson, & Rekau, 1997). The problem of constraining verb generalization is a problem for symbolic systems, because verbs and nouns are both controlled by variables. The same mechanism that gives nouns their ability to generalize to different frames might, one would think, also give verbs the same abilities. This property of symbolic systems has led to a learnability problem that was first described by Baker (1979), and which is referred to as Baker’s Paradox (Gropen, et al., 1989). The paradox arises from the fact that children both seem to be able to overgeneralize a verb to a novel frame and yet they are Symbolically Speaking 38 reluctant to do so. This behavior could be explained if children started with a tendency to overgeneralize, and gradually learned to constrain that generalization because of negative evidence from their parents. But adults do not give enough detailed direct negative evidence for children to avoid overgeneralization, and so it is a puzzle how they learn to constrain themselves. The Dual-path model implemented symbolic processing in a framework that used a statistical learning algorithm, and so the same questions about verb generalization could be applied to the model. If the Dual-path model is simply a symbolic system, then it is also subject to Baker’s Paradox, because the limited size of the training set does not provide enough information to restrict generalization. Specifically, it never received evidence that verbs could not occur in alternative constructions, so we would expect that all verbs would generalize equally well. If, on the other hand, the model is simply a statistical learning system, then we might expect that verbs would not generalize to novel frames, because these novel pairings have a statistical frequency of zero. But if the Dual-path model employs the right mix of these symbolic and statistical properties, it should exhibit properly constrained generalization and thus solve Baker’s paradox. To examine Baker’s paradox in the model, thirty messages that would produce double object dative structures using the verb “throw” (e.g. “The boy throw the girl a cup”) were generated. Several other lists were created by replacing the action semantics of the “throw” sentences with the verbs “dance”, “hit”, “chase”, “surprise”, “pour”, and “load”. In training, none of these verbs occurred in the double object frame. “Dance” only occurred in the intransitive construction. “Hit”, “chase”, “surprise” occurred only in a transitive frame and benefactive frame. The verb “pour” occurred only in the cause motion frame. The verb “load” occurred only in the cause motion frame and the change of state frame. To create a double Symbolically Speaking 39 object dative test set, the goal argument was made more prominent (by setting the construction unit TRANSFER to be more active than MOTION), which made the double object the target structure. All four model subjects for each model type were tested with all of these test sets. Figure 8 shows the average sentence accuracy of double object dative target sentences when all seven verb lists were tested in the Dual-path model at epochs 1000, 2000, 3000 and 4000. For the verb “throw”, which was trained with the double objects, the model achieved a high level of accuracy (above 95%) after 2000 epochs. The other verbs were no t trained in this structure. “Pour” generalized well to this structure above 78% after 2000 epochs. “Load” generalized at 86% at 2000 epochs, but fell to 56% by 4000 epochs. The verbs “surprise”, “chase”, and “hit” generalized above 27% at 1000 epochs, but fell to 5% at 4000 epochs. The verb “dance” never generalized to double object structure. So even though all the verbs except “throw” have never been trained in the double object structure, there were several different varieties of overgeneralization. Some verbs generalized in a symbolic way (e.g. “pour”), while others generalized in a way that reflected the statistical properties of the construction and that verb (e.g. “dance”). But most of the verbs occupied an intermediate position, where they overgeneralized early in learning, and later learned to constrain their generalization. The model’s behavior mimics Gropen and colleagues’ (1989) experimental data with children. In their third experiment, they taught a novel verb in a neutral frame while demonstrating a transfer action (e.g. “This is norping”). They then tested the child’s ability to produce the novel verb in a double object frame (e.g. “You norp me the ball”). They also asked the child to say it using a known dative verb (e.g. “You give me the ball”). They elicited 78% double object responses for verbs that the child knew before the experiment (e.g. give) and 41% double object responses for novel verbs (e.g. norp). At epoch 1000, the model produced 70% double objects for verbs that it Symbolically Speaking 40 had experienced in the double object dative frame (“throw”) and 36% double objects for verbs that had never appeared before in this structure (average of chase, dance, hit, load, pour, surprise), which shows that the model can capture the intermediate nature of novel verb-frame generalization. The developmental pattern of the model also resembled the way that generalization changes in children. The model initially was unable to produce any sentences, but as it learned the language, it started to overgeneralize between epoch 1000 and epoch 2000. This overgeneralization was eventually reduced through more learning, as the model continued to learn. “Surprise”, “chase”, “hit”, and “load” showed this pattern. The pattern was partially due to differences in the speed that the two sides of the model learned their corresponding representations. The messagelexical system learned its mapping first with relatively little constraint from the sequence system. This allowed the model to overgeneralize. But later, the sequence system started to build up stronger connections that recorded the statistical regularities of different arguments with different verbs. That knowledge helped to reduce the overgeneralization of the model. These mappings resemble the broad and narrow constraints that Pinker (1989) has argued for. The mapping of the construction units to the where units is similar to the operation of the broad constraints, where semantics of the whole construction influences the overall order of the arguments. The mapping from the cword units to the word units through the sequencing system represents the operation of the narrow constraints, which involves the way that lexical-specific classes restrict the generalization of the construction. The reason for the variability in the model between different verbs (dance vs. pour) in the degree of their ability to generalize to the double object dative structure was mainly related to overlap in construction features. “Dance” and “throw” shared no construction features, and so it Symbolically Speaking 41 was very difficult to produce “dance” in a dative frame. “Pour” generalized well to the double object frame, because it shared the construction features CAUSE and MOTION with the dative construction. It also seems that the available syntactic frames can also influence overgeneralization. For example, “pour” and “load” shared the same features CAUSE and MOTION with “throw”. But “load” can also occur in the change-of-state construction (e.g. The boy loaded the wagon with hay) where the goal (entity that undergoes a state-change) occurs after the verb. Initially, ”load” overgeneralized to the dative as much as “pour” does. But as it learned to use the change-of-state construction to put the goal after the verb, its ability to use the double object dative was reduced. The change-of-state construction is said to preempt the use of the double object construction as a way of fronting the goal. Preemption or blocking is an important way that children reduce overgeneralization (Clark, 1987; Pinker, 1989). Another important reason for variability in the model’s generalization was due to the simplicity of the model’s verb representations. Lexical semantic similarity was not captured in the model (e.g. “eat” and “drink” shared no semantic features in their what unit representations), so construction features (e.g. TRANSFER) provided the only reliable information about generalization. In people, verbs cluster into semantic classes that are smaller than the broad classes specified by construction features, and these subclasses are predictive of syntactic frames that they can appear in (Fisher, et al., 1991). Dissociating Processing Systems in Aphasia Connectionist models have been used to link our understanding of normal language processing to cases where brain damage has impaired critical processing systems (Dell, Schwartz, Martin, & Saffran, 1997; Plaut, McClelland, Seidenberg, & Patterson, 1996) and these Symbolically Speaking 42 studies have helped us understand how the architecture of a language processing system can influence the type of symptoms that appear in impaired patients. In describing the architecture of the Dual-path model, I have concentrated on the way the architecture enables the model to exhibit certain functional behaviors. But it is also desirable to show that this model really approximates the actual architecture of language in the brain. This can be done by establishing that damage to the physical architecture of the model can lead to symptoms that are similar to patients with injury to real brain systems. To do this, the lesions will be applied to the two separate pathways, and the resulting behavioral effects will be recorded. These behavioral effects will be compared with aphasic symptoms, to see if the model’s processing abilities are damaged in ways that are similar to patients with brain injuries. Double dissociations in the production of different lexical categories have been an important type of evidence for separate processing systems. Researchers have suggested that function words (prepositions, determiners, auxillary verbs) and content words (nouns, adjectives, verbs) are represented in separate systems (Goodglass & Kaplan, 1983). Some patients have more difficulty with function words and relatively less difficulty with content words. Meanwhile, other patients have the opposite pattern, with content words being relatively spared and function words being relatively impaired. Other researchers have found that light and heavy verbs also dissociate (Breedin, Saffran, Schwartz, 1998). Light verbs (such as “go”, “give”, “have”, “do”, “get”, “make”, and “take”) are the first to be learned, are the most frequent in the speech of children, and are the first the children learn crosslinguistically (Clark, 1978). Some aphasic patients have trouble with heavy verbs, and are relatively spared with respect to light verbs (Berndt, Haendiges, Mitchum, & Sandson, 1997). Other patients have the reverse pattern (Breedin, Saffran, & Schwartz, 1998). These double dissociations are important, because they Symbolically Speaking 43 demonstrate that each behavior occupies spatially-separated processing systems, in that there exists a way to focally lesion each system without automatically impairing the other. Gordon and Dell (in press) argue that the function-content word dissociation and the light-heavy verb dissociation reflect an underlying distinction between syntactic and semantic representations, and lexical items are dependent on these separate representations to different degrees. Using a two-layer connectionist model that learned to produce simple sentences, they showed that the model learned representations where light verbs depended more on the syntactic system and heavy verbs depended more on the semantic system. Since light and heavy verbs both had syntactic and semantic determinants, these dissociations in the model arose out of differences in the degree of dependence on each system. Because the Dual-path model also claims that different types of representations (as a result of different pathways) are independently influencing lexical representation, it is possible to examine whether this model will also exhibit aphasic dissociations. To test the importance of separate pathways in the Dual-path model, lesions were applied to each of the pathways to create two types of impaired models. One model (the what-word lesioned model, abbreviated as the WWL model) was created by lesioning the messagelexical system, specifically the links between the what units and the word units. The other model lesioned the sequencing system by damaging the links between the hidden units and the word units and was called the hidden-word lesioned model (HWL model). Each of the four Dual-path model subjects was lesioned both ways to create eight models. To increase the power of the statistical analysis, a copy of each of these models was created with a new randomized lesioning. These sixteen models were tested on the 2000 novel sentences test set, and the results were coded for analysis. Lesioning of the what-word links was more damaging to the network than Symbolically Speaking 44 the same amount of lesioning of the hidden-word links. To reduce the differences due to overall severity of the lesion, the hidden-word lesion removed 7% of its connections, while the whatword lesion only affected 2% of its links. These lesions led to the same average word accuracy (correct word in the correct position) over the model subjects of 63% for both WWL and HWL models. Table 8 shows some sample output of the two lesioned models and the intended target utterance. Superficially, this sample illustrates some of the differences in the way the lesions influence processing. For example, the HWL model seems to produce content words that are in the target message, while the WWL model produces some non-contextual substitutions (“robin” becomes “church”). The WWL model also seems to omit content words frequently, for example in sentence 2 and 3 or the verbs in sentence 4 and 5. To examine the use of function and content words, the percentage of function words that were correctly produced (as given by the target sentence) and the corresponding percentage of content words will be the dependent measures. Function words in the model include prepositions, determiners, and the auxillary verbs. Content words constituted all other words. As shown in figure 9, the WWL model produced function words marginally better than the HWL model (82% and 64% respectively, F(1,14) = 3.97, p < 0.1). The reverse was true for the content words, with the WWL producing fewer correct words (51%) than the HWL model (64%) (F(1,14) = 5.25, p < 0.05). This double dissociation was natural outcome of the constraints of the Dual-path architecture. Content words, by definition, have content, or meaning, and so they depend more on the message and the what-word pathway. Function words were only produced in certain syntactic contexts, and so they needed the syntactic information that is provided by the sequential system. Lesioning each of these pathways selectively damaged one component, leaving the other relatively spared. Symbolically Speaking 45 The other dissociation that was examined in the model was the light-heavy verb dissociation. Several theories of verb semantics have argued that the light verbs represent basic primitives of sentence meaning (Goldberg, 1995). In the model, I have incorporated these ideas by treating some verbs as the default verb for a construction (these verbs are marked as bold in Table 2). This means that these verbs do not have features in the action event role. For example, the verb “throw” had a feature in the action event role, but the verb “give” did not. Because of this difference in the features in the what-where links, the model should depend more on the messagelexical system for heavy verbs, and more on the sequencing system for the light verbs. The WWL model produced light verbs correctly 89% of the time, while the HWL model produced them only 53% of the time (F(1,14) = 15.98, p< 0.05). For heavy verbs, the WWL model (42%) was more impaired than the HWL model (77%) (F(1,14) = 37.09, p < 0.001). So the model exhibited a double dissociation for verb complexity, as has been found in the aphasic literature. For function/content word use and light/heavy verb use, the model exhibited double dissociations that have been argued to reflect selective impairment of processing modules in aphasic brains. In the model, these modules were given concrete instantiations and have been shown to work together to produce sentences. One module corresponded to the messagelexical system (impaired in the WWL model), which supported the production of semantically rich information like heavy verbs and content words. The other module was the sequencing system (impaired in the HWL model), which support categories that were identified with syntactic frames like light verbs and function words. The original motivation for these two modules was the computational demands of getting a connectionist model to generalize more symbolically. But the solution to that problem also nicely accounts for these aphasic dissociations. Symbolically Speaking 46

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Symbolically speaking: a connectionist model of sentence production

متن کامل

Pii: S0364-0213(02)00079-4

12 The ability to combine words into novel sentences has been used to argue that humans have symbolic 13 language production abilities. Critiques of connectionist models of language often center on the inabil14 ity of these models to generalize symbolically (Fodor & Pylyshyn, 1988; Marcus, 1998). To address 15 these issues, a connectionist model of sentence production was developed. The model h...

متن کامل

A Model of Speaking Strategies for EFL Learners

This study intended to develop a model describing speaking strategies for EFL learners by taking into account the effects of learners’ gender and proficiency on the application of strategies. Accordingly, this study was planned to have two main analyses, namely qualitative and quantitative. In this respect, 30 EFL learners' viewpoints were sought, and then, based on the elicited responses, a 21...

متن کامل

Comprehension of Complex Sentences in the Persian-Speaking Patients With Aphasia

Introduction: To study sentence comprehension in Persian-speaking Patients with Aphasia considering the factors of complexity. Methods: In this cross-sectional study, the performance of 6 non-fluent aphasic patients were tested and their performance was compared to 15 matched control group. Comprehension of semantically reversible sentences was assessed using a binary sentence-picture matching...

متن کامل

Canonicity Effect on Sentence Processing of Persian-speaking Broca’s Patients

Introduction: Fundamental notions of mapping hypothesis and canonicity were scrutinized in Persian-speaking aphasics. Methods: To this end, the performance of four age-, education-, and gender matched Persian-speaking Broca's patients and eight matched healthy controls in diverse complex structures were compared via the conduction of two tasks of syntactic comprehension and grammaticality jud...

متن کامل

From Embodiment to Metaphor: A Study on Social Cognitive Development and Conceptual Metaphor in Persian-Speaking Children

This study explores the metaphoric comprehension of normal Persian-speaking children, as well as theories of cognitive development and cultural and social impacts. The researchers discuss the improvement of the understanding of ontological conceptual metaphors through age growth and cognitive development, and how it helps to expand children’s thoughts and knowledge of the world. In this study, ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2001

Symbolically Speaking 1 Running head: SYMBOLICALLY SPEAKING SUBMITTED TO COGNITIVE SCIENCE --- DO NOT CITE Symbolically Speaking: A Connectionist Model of Sentence Production

نویسنده

چکیده

منابع مشابه

Symbolically speaking: a connectionist model of sentence production

Pii: S0364-0213(02)00079-4

A Model of Speaking Strategies for EFL Learners

Comprehension of Complex Sentences in the Persian-Speaking Patients With Aphasia

Canonicity Effect on Sentence Processing of Persian-speaking Broca’s Patients

From Embodiment to Metaphor: A Study on Social Cognitive Development and Conceptual Metaphor in Persian-Speaking Children

عنوان ژورنال:

اشتراک گذاری